Travel Package Purchase Prediction - Problem Statement

Submission : Ranjan Mitra

Context:

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Objective:

To predict which customer is more likely to purchase the newly introduced travel package.

Data Description:

Customer details:

CustomerID: Unique customer ID ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes) Age: Age of customer TypeofContact: How customer was contacted (Company Invited or Self Inquiry) CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3 Occupation: Occupation of customer Gender: Gender of customer NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer PreferredPropertyStar: Preferred hotel property rating by customer MaritalStatus: Marital status of customer NumberOfTrips: Average number of trips in a year by customer Passport: The customer has a passport or not (0: No, 1: Yes) OwnCar: Whether the customers own a car or not (0: No, 1: Yes) NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer Designation: Designation of the customer in the current organization MonthlyIncome: Gross monthly income of the customer Customer interaction data:

PitchSatisfactionScore: Sales pitch satisfaction score ProductPitched: Product pitched by the salesperson NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch DurationOfPitch: Duration of the pitch by a salesperson to the customer Note:

Please note XGBoost can take a significantly longer time to run, so if you have time complexity issues then you can avoid tuning XGBoost. No marks will be deducted if XGBoost tuning is not attempted.

Note: The first section of the notebook is the section that has been covered multiple times in the previous case studies. For this discussion, this part can be skipped and we can directly refer to this summary of observations from EDA.

Overview of the dataset

Let's start by importing libraries we need.

View the first 5 rows of the dataset.

Check data types and number of non-null values for each column.

Summary of the dataset

Number of unique values in each column

We will fill the missing values in the TypeofContact column with 'Other', as the exact values for that category are not known

Number of observations in each category

Univariate Analysis

Generating log transformation on column lg_DurationOfPitch to fcorrect the skewness in the distribution

EDA

Function to create barplots that indicate percentage for each category

Bivariate analysis

MaritalStatus v/s ProdTaken

SUMMARY OF EDA

Data Cleaning:

Observations from EDA:

Split the dataset

Building Models

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer who bought a package wrong(false positive) and/or will result in loss of revenues.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Model building - Decision Tree,Bagging Classifier,Random Forest

Decision Tree Model

Visualizing the Decision Tree

According to the decision tree model, Age is the most important variable for predicting the customer default.

Using GridSearch for Hyperparameter tuning of our tree mode

Hyperparameter Tuning

Confusion Matrix - decision tree with tuned hyperparameters

Visualizing the Hyper parameter Tuned Decision Tree

Plotting the feature importance of each variable

Bagging Classifier

Bagging Classifier with weighted decision tree

*** Bagging classifier with a weighted decision tree is giving very good accuracy and prediction but is not able to generalize well on test data in terms of recall.

Hyperparameter Tuning

Bagging Classifier

Some of the important hyperparameters available for bagging classifier are:

Tuning Bagging Classifier

Random Forest Model

Random forest with class weights

Hyperparameter Tuning

Comparing all the models

Boosting Models

AdaBoost Regressor

Hyperparameter Tuning

Gradient Boosting Classifier

Hyperparameter Tuning

XGBoost Regressor

Hyperparameter Tuning

Stacking Model

Now, let's build a stacking model with the tuned models - decision tree, random forest, and gradient boosting, then use XGBoost to get the final prediction.

Comparing all models

Feature importance of XGBoost classifier (compared to Stacking Classifier)

Actionable Insights & Recommendations